Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research

Identifieur interne : 000336 ( Main/Exploration ); précédent : 000335; suivant : 000337

A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research

Auteurs : Jin Chen [États-Unis] ; Daniel Lopresti [États-Unis] ; Bart Lamiroy [France]

Source :

RBID : Hal:inria-00627844

Abstract

Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, we introduce a noisy and unstructured handwriting dataset that aims for promoting and evaluating robust document analysis algorithms for real-world challenges, as a result of emphasizing the process of building and curating a dataset. First, we explain the data acquisition process and characterize its critical features as noisy and unstructured. Then, we discuss a set of real-world scenarios that might benefit from using our notebook dataset. As an on-going activity, so far we have collected 18 handwritten note-books from nine college students, resulting in a total of 499 pages. We expect to collect over 100 notebooks, or equivalently about 3,000 pages, from at least 50 students. This dataset is available to the research community via the Lehigh document analysis and exploitation (DAE) platform.

Url:
DOI: 10.1145/2034617.2034620


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research</title>
<author>
<name sortKey="Chen, Jin" sort="Chen, Jin" uniqKey="Chen J" first="Jin" last="Chen">Jin Chen</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-21735" status="VALID">
<orgName>Computer Science & Engineering Department</orgName>
<orgName type="acronym">CSE</orgName>
<desc>
<address>
<addrLine>P.C. Rossin College of Engineering & Applied Science - Computer Science and Engineering - Packard Laboratory, 19 Memorial Drive West - Lehigh University, Bethlehem PA 18015</addrLine>
<country key="US"></country>
</address>
<ref type="url">http://www.cse.lehigh.edu/</ref>
</desc>
<listRelation>
<relation active="#struct-301550" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-301550" type="direct">
<org type="institution" xml:id="struct-301550" status="VALID">
<orgName>Lehigh University, Bethlehem, USA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Lopresti, Daniel" sort="Lopresti, Daniel" uniqKey="Lopresti D" first="Daniel" last="Lopresti">Daniel Lopresti</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-21735" status="VALID">
<orgName>Computer Science & Engineering Department</orgName>
<orgName type="acronym">CSE</orgName>
<desc>
<address>
<addrLine>P.C. Rossin College of Engineering & Applied Science - Computer Science and Engineering - Packard Laboratory, 19 Memorial Drive West - Lehigh University, Bethlehem PA 18015</addrLine>
<country key="US"></country>
</address>
<ref type="url">http://www.cse.lehigh.edu/</ref>
</desc>
<listRelation>
<relation active="#struct-301550" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-301550" type="direct">
<org type="institution" xml:id="struct-301550" status="VALID">
<orgName>Lehigh University, Bethlehem, USA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Lamiroy, Bart" sort="Lamiroy, Bart" uniqKey="Lamiroy B" first="Bart" last="Lamiroy">Bart Lamiroy</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-119680" status="OLD">
<orgName>Querying Graphics through Analysis and Recognition</orgName>
<orgName type="acronym">QGAR</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://qgar.loria.fr</ref>
</desc>
<listRelation>
<relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-160" type="direct">
<org type="laboratory" xml:id="struct-160" status="OLD">
<orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc>
<address>
<addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation>
<relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect">
<org type="institution" xml:id="struct-300291" status="OLD">
<orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect">
<org type="institution" xml:id="struct-300292" status="OLD">
<orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect">
<org type="institution" xml:id="struct-300293" status="OLD">
<orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Lorraine</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Lorraine</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:inria-00627844</idno>
<idno type="halId">inria-00627844</idno>
<idno type="halUri">https://hal.inria.fr/inria-00627844</idno>
<idno type="url">https://hal.inria.fr/inria-00627844</idno>
<idno type="doi">10.1145/2034617.2034620</idno>
<date when="2011-09-17">2011-09-17</date>
<idno type="wicri:Area/Hal/Corpus">000008</idno>
<idno type="wicri:Area/Hal/Curation">000008</idno>
<idno type="wicri:Area/Hal/Checkpoint">000083</idno>
<idno type="wicri:Area/Main/Merge">000341</idno>
<idno type="wicri:Area/Main/Curation">000336</idno>
<idno type="wicri:Area/Main/Exploration">000336</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research</title>
<author>
<name sortKey="Chen, Jin" sort="Chen, Jin" uniqKey="Chen J" first="Jin" last="Chen">Jin Chen</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-21735" status="VALID">
<orgName>Computer Science & Engineering Department</orgName>
<orgName type="acronym">CSE</orgName>
<desc>
<address>
<addrLine>P.C. Rossin College of Engineering & Applied Science - Computer Science and Engineering - Packard Laboratory, 19 Memorial Drive West - Lehigh University, Bethlehem PA 18015</addrLine>
<country key="US"></country>
</address>
<ref type="url">http://www.cse.lehigh.edu/</ref>
</desc>
<listRelation>
<relation active="#struct-301550" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-301550" type="direct">
<org type="institution" xml:id="struct-301550" status="VALID">
<orgName>Lehigh University, Bethlehem, USA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Lopresti, Daniel" sort="Lopresti, Daniel" uniqKey="Lopresti D" first="Daniel" last="Lopresti">Daniel Lopresti</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-21735" status="VALID">
<orgName>Computer Science & Engineering Department</orgName>
<orgName type="acronym">CSE</orgName>
<desc>
<address>
<addrLine>P.C. Rossin College of Engineering & Applied Science - Computer Science and Engineering - Packard Laboratory, 19 Memorial Drive West - Lehigh University, Bethlehem PA 18015</addrLine>
<country key="US"></country>
</address>
<ref type="url">http://www.cse.lehigh.edu/</ref>
</desc>
<listRelation>
<relation active="#struct-301550" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-301550" type="direct">
<org type="institution" xml:id="struct-301550" status="VALID">
<orgName>Lehigh University, Bethlehem, USA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Lamiroy, Bart" sort="Lamiroy, Bart" uniqKey="Lamiroy B" first="Bart" last="Lamiroy">Bart Lamiroy</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-119680" status="OLD">
<orgName>Querying Graphics through Analysis and Recognition</orgName>
<orgName type="acronym">QGAR</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://qgar.loria.fr</ref>
</desc>
<listRelation>
<relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-160" type="direct">
<org type="laboratory" xml:id="struct-160" status="OLD">
<orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc>
<address>
<addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation>
<relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect">
<org type="institution" xml:id="struct-300291" status="OLD">
<orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect">
<org type="institution" xml:id="struct-300292" status="OLD">
<orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect">
<org type="institution" xml:id="struct-300293" status="OLD">
<orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Lorraine</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Lorraine</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
</analytic>
<idno type="DOI">10.1145/2034617.2034620</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, we introduce a noisy and unstructured handwriting dataset that aims for promoting and evaluating robust document analysis algorithms for real-world challenges, as a result of emphasizing the process of building and curating a dataset. First, we explain the data acquisition process and characterize its critical features as noisy and unstructured. Then, we discuss a set of real-world scenarios that might benefit from using our notebook dataset. As an on-going activity, so far we have collected 18 handwritten note-books from nine college students, resulting in a total of 499 pages. We expect to collect over 100 notebooks, or equivalently about 3,000 pages, from at least 50 students. This dataset is available to the research community via the Lehigh document analysis and exploitation (DAE) platform.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
<li>États-Unis</li>
</country>
<region>
<li>Lorraine</li>
</region>
<settlement>
<li>Nancy</li>
</settlement>
<orgName>
<li>Institut national polytechnique de Lorraine</li>
<li>Université Nancy 2</li>
<li>Université de Lorraine</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<noRegion>
<name sortKey="Chen, Jin" sort="Chen, Jin" uniqKey="Chen J" first="Jin" last="Chen">Jin Chen</name>
</noRegion>
<name sortKey="Lopresti, Daniel" sort="Lopresti, Daniel" uniqKey="Lopresti D" first="Daniel" last="Lopresti">Daniel Lopresti</name>
</country>
<country name="France">
<region name="Lorraine">
<name sortKey="Lamiroy, Bart" sort="Lamiroy, Bart" uniqKey="Lamiroy B" first="Bart" last="Lamiroy">Bart Lamiroy</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000336 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000336 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:inria-00627844
   |texte=   A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024